22 research outputs found

    Improving Visual Representation Learning through Perceptual Understanding

    Full text link
    We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.Comment: v2: add additional details on MSG-MAE. In Proc CVPR 202

    Workshop Report on Managing Solar Radiation

    Get PDF
    The basic concept of managing Earth's radiation budget is to reduce the amount of incoming solar radiation absorbed by the Earth so as to counterbalance the heating of the Earth that would otherwise result from the accumulation of greenhouse gases. The workshop did not seek to decide whether or under what circumstances solar radiation management should be deployed or which strategies or technologies might be best, if it were deployed. Rather, the workshop focused on defining what kinds of information might be most valuable in allowing policy makers more knowledgeably to address the various options for solar radiation management

    AXES at TRECVID 2012: KIS, INS, and MED

    Get PDF
    The AXES project participated in the interactive instance search task (INS), the known-item search task (KIS), and the multimedia event detection task (MED) for TRECVid 2012. As in our TRECVid 2011 system, we used nearly identical search systems and user interfaces for both INS and KIS. Our interactive INS and KIS systems focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our KIS experiments were media professionals from the BBC; our INS experiments were carried out by students and researchers at Dublin City University. We performed comparatively well in both experiments. Our best KIS run found 13 of the 25 topics, and our best INS runs outperformed all other submitted runs in terms of P@100. For MED, the system presented was based on a minimal number of low-level descriptors, which we chose to be as large as computationally feasible. These descriptors are aggregated to produce high-dimensional video-level signatures, which are used to train a set of linear classifiers. Our MED system achieved the second-best score of all submitted runs in the main track, and best score in the ad-hoc track, suggesting that a simple system based on state-of-the-art low-level descriptors can give relatively high performance. This paper describes in detail our KIS, INS, and MED systems and the results and findings of our experiments

    The AXES research video search system

    Get PDF
    We will demonstrate a multimedia content information retrieval engine developed for audiovisual digital libraries targeted at academic researchers and journalists. It is the second of three multimedia IR systems being developed by the AXES project1. The system brings together traditional text IR and state-of-the-art content indexing and retrieval technologies to allow users to search and browse digital libraries in novel ways. Key features include: metadata and ASR search and filtering, on-the-fly visual concept classification (categories, faces, places, and logos), and similarity search (instances and faces)

    The AXES submissions at TrecVid 2013

    Get PDF
    The AXES project participated in the interactive instance search task (INS), the semantic indexing task (SIN) the multimedia event recounting task (MER), and the multimedia event detection task (MED) for TRECVid 2013. Our interactive INS focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our INS experiments were carried out by students and researchers at Dublin City University. Our best INS runs performed on par with the top ranked INS runs in terms of P@10 and P@30, and around the median in terms of mAP. For SIN, MED and MER, we use systems based on state- of-the-art local low-level descriptors for motion, image, and sound, as well as high-level features to capture speech and text and the visual and audio stream respectively. The low-level descriptors were aggregated by means of Fisher vectors into high- dimensional video-level signatures, the high-level features are aggregated into bag-of-word histograms. Using these features we train linear classifiers, and use early and late-fusion to combine the different features. Our MED system achieved the best score of all submitted runs in the main track, as well as in the ad-hoc track. This paper describes in detail our INS, MER, and MED systems and the results and findings of our experimen

    AXES at TRECVid 2012: KIS, INS, and MED

    Get PDF
    International audienceThe AXES project participated in the interactive instance search task (INS), the known-item search task (KIS), and the multimedia event detection task (MED) for TRECVid 2012. As in our TRECVid 2011 system, we used nearly identical search systems and user interfaces for both INS and KIS. Our interactive INS and KIS systems focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our KIS experiments were media professionals from the BBC; our INS experiments were carried out by students and researchers at Dublin City University. We performed comparatively well in both experiments. Our best KIS run found 13 of the 25 topics, and our best INS runs outperformed all other submitted runs in terms of P@100. For MED, the system presented was based on a minimal number of low-level descriptors, which we chose to be as large as computationally feasible. These descriptors are aggregated to produce high-dimensional video-level signatures, which are used to train a set of linear classifiers. Our MED system achieved the second-best score of all submitted runs in the main track, and best score in the ad-hoc track, suggesting that a simple system based on state-of-the-art low-level descriptors can give relatively high performance. This paper describes in detail our KIS, INS, and MED systems and the results and findings of our experiments

    On-the-fly visual category search in web-scale image collections

    No full text
    This thesis tackles the problem of large-scale visual search for categories within large collections of images. Given a textual description of a visual category, such as 'car' or 'person', the objective is to retrieve images containing that category from the corpus quickly and accurately, and without the need for auxiliary meta-data or, crucially and in contrast to previous approaches, expensive pre-training. The general approach to identifying different visual categories within a dataset is to train classifiers over features extracted from a set of training images. The performance of such classifiers relies heavily on sufficiently discriminative image representations, and many methods have been proposed which involve the aggregating of local appearance features into rich bag-of-words encodings. We begin by conducting a comprehensive evaluation of the latest such encodings, identifying best-of-breed practices for training powerful visual models using these representations. We also contrast these methods with the latest breed of Convolutional Network (ConvNet) based features, thus developing a state-of-the-art architecture for large-scale image classification. Following this, we explore how a standard classification pipeline can be adapted for use in a real-time setting. One of the major issues, particularly with bag-of-words based methods, is the high dimensionality of the encodings, which causes ranking over large datasets to be prohibitively expensive. We therefore assess different methods for compressing such features, and further propose a novel cascade approach to ranking which both reduces ranking time and improves retrieval performance. Finally, we explore the problem of training visual models on-the-fly, making use of visual data dynamically collected from the web to train classifiers on demand. On this basis, we develop a novel GPU architecture for on-the-fly visual category search which is capable of retrieving previously unknown categories over unannonated datasets of millions of images in just a few seconds.</p

    On-the-fly visual category search in web-scale image collections

    No full text
    This thesis tackles the problem of large-scale visual search for categories within large collections of images. Given a textual description of a visual category, such as 'car' or 'person', the objective is to retrieve images containing that category from the corpus quickly and accurately, and without the need for auxiliary meta-data or, crucially and in contrast to previous approaches, expensive pre-training. The general approach to identifying different visual categories within a dataset is to train classifiers over features extracted from a set of training images. The performance of such classifiers relies heavily on sufficiently discriminative image representations, and many methods have been proposed which involve the aggregating of local appearance features into rich bag-of-words encodings. We begin by conducting a comprehensive evaluation of the latest such encodings, identifying best-of-breed practices for training powerful visual models using these representations. We also contrast these methods with the latest breed of Convolutional Network (ConvNet) based features, thus developing a state-of-the-art architecture for large-scale image classification. Following this, we explore how a standard classification pipeline can be adapted for use in a real-time setting. One of the major issues, particularly with bag-of-words based methods, is the high dimensionality of the encodings, which causes ranking over large datasets to be prohibitively expensive. We therefore assess different methods for compressing such features, and further propose a novel cascade approach to ranking which both reduces ranking time and improves retrieval performance. Finally, we explore the problem of training visual models on-the-fly, making use of visual data dynamically collected from the web to train classifiers on demand. On this basis, we develop a novel GPU architecture for on-the-fly visual category search which is capable of retrieving previously unknown categories over unannonated datasets of millions of images in just a few seconds.This thesis is not currently available via ORA

    CHATFIELD et al.: THE DEVIL IS IN THE DETAILS 1 The devil is in the details: an evaluation of recent feature encoding methods

    No full text
    A large number of novel encodings for bag of visual words models have been proposed in the past two years to improve on the standard histogram of quantized local features. Examples include locality-constrained linear encoding [23], improved Fisher encoding [17], super vector encoding [27], and kernel codebook encoding [20]. While several authors have reported very good results on the challenging PASCAL VOC classification data by means of these new techniques, differences in the feature computation and learning algorithms, missing details in the description of the methods, and different tuning of the various components, make it impossible to compare directly these methods and hard to reproduce the results reported. This paper addresses these shortcomings by carrying out a rigorous evaluation of these new techniques by: (1) fixing the other elements of the pipeline (features, learning, tuning); (2) disclosing all the implementation details, and (3) identifying both those aspects of each method which are particularly important to achieve good performance, and those aspects which are less critical. This allows a consistent comparative analysis of these encoding methods. Several conclusions drawn from our analysis cannot be inferred from the original publications.
    corecore